Ron Neely, Rowan Data Mining 2 Due 2018.09.27, Assigned 2019.09.19
HW Remember Sensitivity Analysis from DM1? Essentially we make an average vector which is the mean of each variable in the data file. We then compute the output of the average vector with each variable ranging from 0 to 1 and see what the difference is. (In R, sapply(vector,mean) might be helpful).
We will use SALib to analyze sensitivity: SALib Tutorial
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import KFold
# pip install SALib # https://github.com/SALib
import SALib.sample.saltelli as ss
import SALib.analyze.sobol as sa
import SALib.plotting as sp
from SALib.sample import morris as ms
import SALib.analyze.morris as ma
import sys
sys.version
def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
df['y'] = pd.Series(sklearn_dataset.target)
return df
df = sklearn_to_df(datasets.load_breast_cancer())
print(df.shape)
print(df.dtypes, end=' ')
df.head(3)
y = df['y']
print(len(y))
print(y.dtype)
y.head(3)
X_cols = df.columns[0:-1]
scaler = StandardScaler()
df[X_cols] = scaler.fit_transform(df[X_cols])
X = df[X_cols]
print(X.shape)
X.head(3)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
Looks like default split was about 75% train, 25% test.
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
mlp.fit(X_train,y_train)
y_pred = mlp.predict(X_test)
print(" actual\n + -\npred+[[tp, fp]\ndict- [fn, tn]]")
print(confusion_matrix(y_test, y_pred), "\n")
print("\n")
print(classification_report(y_test, y_pred))
g = sns.pairplot(df)
g.savefig("breast_cancer_pairs.png")
Graphically looking at the last row (y) the following input factors seem to have an effect: mean_radius, mean_perimeter, mean_concave points, worst_radius, and worst_concave points. However, it is hard to see correlation in variance as the output is binary. There are definitely correlations between input variables and other input variables.
Let's use Sobol Sensitivity Analysis from SALib per SALib Tutorial.
We are doing the sensitivity analysis on original X and y data. This gives us a view of the problem space even before we begin building a model.
SALib is quirky. It requires samples to be modulo 0 (number of features + 2). We have 30 features, so following calculates the length most records we can use.
problem = {
'num_vars': 30,
'names' : list(X_cols),
'bounds' : [ [min, max] for min, max in zip(X.min(), X.max()) ],
'groups' : None
}
length = (30+2)*17 # 512: 30 features +2 saltelli boundary less thant 569 records
sample = ss.sample(problem, length, calc_second_order=False)
si = sa.analyze(problem, y.values[:length], calc_second_order=False)
dfs = pd.DataFrame([list(x) for x in zip(si['S1'], si['ST'], sample.mean(axis=0))],
columns = ["1st", "total", "mean of input"], index = problem['names'])
SALib first order Sobol results show how sensitive output is to the variablity each individual input variable.
Features to the top of the following list have the most individual influence.
dfs['1st'].sort_values().plot(kind='barh')
SALib total Sobol results show how senstive output is to interactions between input variables.
Features to the top of the following list have the most interaction effects.
dfs['total'].sort_values().plot(kind='barh')
m = RandomForestClassifier()
m.fit(X,y)
dff = pd.DataFrame(m.feature_importances_, columns=["importance"], index=X_cols.values)
dff = dff.sort_values(by='importance', ascending=True)
dff.plot(kind='barh')
SALib Sobol and the Random Forest model both show worst_perimiter as one of the top individual fators. There is not much more in agreement. Random forest analysis gives a much higher precedence to mean_concave_points, whereas SALib Sobol gives it a much lower performance. Seems like Random orest does a better job.
(Note you can steal my existing files for implementing the NN and SVM on the breast cancer files (see week 1 and 2 content)). You may want to switch back to a non-classification result (I.e. remove the as.factor( ) from the formula to determine sensitivity). How do these compare with the results from part 1.
svm = SVC()
svm.fit(X_train,y_train)
y_pred = svm.predict(X_test)
print(" actual\n + -\npred+[[tp, fp]\ndict- [fn, tn]]")
print(confusion_matrix(y_test, y_pred), "\n")
print("\n")
print(classification_report(y_test, y_pred))
y_pred = svm.predict(X[:length])
problem = {
'num_vars': 30,
'names' : list(X_cols),
'bounds' : [ [min, max] for min, max in zip(X[:length].min(), X[:length].max()) ],
'groups' : None
}
sample = ss.sample(problem, length, calc_second_order=False)
si = sa.analyze(problem, y_pred, calc_second_order=False)
dfs = pd.DataFrame([list(x) for x in zip(si['S1'], si['ST'], sample.mean(axis=0))],
columns = ["1st", "total", "mean of input"], index = problem['names'])
dfs = dff.sort_values(by='importance', ascending=True)
dfs.plot(kind='barh')
Worst_perimeter, smoothness_error and worst_texture are still among the most important factors - though the order has changed.
Remember to get the 10th point ask and answer a challenge problem that is at least tangentially connected with this HW.
How would SALib morris analysis compare to Sobel and RandomForest?
length = int(569/(30 + 1))*31
length
sample = ms.sample(problem, length, num_levels=4, grid_jump=2)
Sj = ma.analyze(problem, sample, y.values[:length].astype(float), print_to_console=False)
ig, (ax1, ax2) = plt.subplots(1,2)
mp.horizontal_bar_plot(ax1, Si, {})
mp.covariance_plot(ax2, Si, {})
SALib is quirky. Morris analysis didn't work.